Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SF-TMN: SlowFast Temporal Modeling Network for Surgical Phase Recognition (2306.08859v1)

Published 15 Jun 2023 in cs.CV and cs.RO

Abstract: Automatic surgical phase recognition is one of the key technologies to support Video-Based Assessment (VBA) systems for surgical education. Utilizing temporal information is crucial for surgical phase recognition, hence various recent approaches extract frame-level features to conduct full video temporal modeling. For better temporal modeling, we propose SlowFast Temporal Modeling Network (SF-TMN) for surgical phase recognition that can not only achieve frame-level full video temporal modeling but also achieve segment-level full video temporal modeling. We employ a feature extraction network, pre-trained on the target dataset, to extract features from video frames as the training data for SF-TMN. The Slow Path in SF-TMN utilizes all frame features for frame temporal modeling. The Fast Path in SF-TMN utilizes segment-level features summarized from frame features for segment temporal modeling. The proposed paradigm is flexible regarding the choice of temporal modeling networks. We explore MS-TCN and ASFormer models as temporal modeling networks and experiment with multiple combination strategies for Slow and Fast Paths. We evaluate SF-TMN on Cholec80 surgical phase recognition task and demonstrate that SF-TMN can achieve state-of-the-art results on all considered metrics. SF-TMN with ASFormer backbone outperforms the state-of-the-art Not End-to-End(TCN) method by 2.6% in accuracy and 7.4% in the Jaccard score. We also evaluate SF-TMN on action segmentation datasets including 50salads, GTEA, and Breakfast, and achieve state-of-the-art results. The improvement in the results shows that combining temporal information from both frame level and segment level by refining outputs with temporal refinement stages is beneficial for the temporal modeling of surgical phases.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., De Mathelin, M., Padoy, N.: Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE transactions on medical imaging 36(1), 86–97 (2016) Zia et al. [2018] Zia, A., Hung, A., Essa, I., Jarc, A.: Surgical activity recognition in robot-assisted radical prostatectomy using deep learning. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part IV 11, pp. 273–280 (2018). Springer Jin et al. [2021] Jin, Y., Long, Y., Chen, C., Zhao, Z., Dou, Q., Heng, P.-A.: Temporal memory relation network for workflow recognition from surgical video. IEEE Transactions on Medical Imaging 40(7), 1911–1923 (2021) Jin et al. [2022] Jin, Y., Long, Y., Gao, X., Stoyanov, D., Dou, Q., Heng, P.-A.: Trans-svnet: hybrid embedding aggregation transformer for surgical workflow analysis. International Journal of Computer Assisted Radiology and Surgery 17(12), 2193–2202 (2022) Zhang et al. [2022] Zhang, B., Abbing, J., Ghanem, A., Fer, D., Barker, J., Abukhalil, R., Goel, V.K., Milletarì, F.: Towards accurate surgical workflow recognition with convolutional networks and transformers. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 10(4), 349–356 (2022) Kirtac et al. [2022] Kirtac, K., Aydin, N., Lavanchy, J.L., Beldi, G., Smit, M., Woods, M.S., Aspart, F.: Surgical phase recognition: From public datasets to real-world data. Applied Sciences 12(17), 8746 (2022) Demir et al. [2022] Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zia, A., Hung, A., Essa, I., Jarc, A.: Surgical activity recognition in robot-assisted radical prostatectomy using deep learning. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part IV 11, pp. 273–280 (2018). Springer Jin et al. [2021] Jin, Y., Long, Y., Chen, C., Zhao, Z., Dou, Q., Heng, P.-A.: Temporal memory relation network for workflow recognition from surgical video. IEEE Transactions on Medical Imaging 40(7), 1911–1923 (2021) Jin et al. [2022] Jin, Y., Long, Y., Gao, X., Stoyanov, D., Dou, Q., Heng, P.-A.: Trans-svnet: hybrid embedding aggregation transformer for surgical workflow analysis. International Journal of Computer Assisted Radiology and Surgery 17(12), 2193–2202 (2022) Zhang et al. [2022] Zhang, B., Abbing, J., Ghanem, A., Fer, D., Barker, J., Abukhalil, R., Goel, V.K., Milletarì, F.: Towards accurate surgical workflow recognition with convolutional networks and transformers. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 10(4), 349–356 (2022) Kirtac et al. [2022] Kirtac, K., Aydin, N., Lavanchy, J.L., Beldi, G., Smit, M., Woods, M.S., Aspart, F.: Surgical phase recognition: From public datasets to real-world data. Applied Sciences 12(17), 8746 (2022) Demir et al. [2022] Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Long, Y., Chen, C., Zhao, Z., Dou, Q., Heng, P.-A.: Temporal memory relation network for workflow recognition from surgical video. IEEE Transactions on Medical Imaging 40(7), 1911–1923 (2021) Jin et al. [2022] Jin, Y., Long, Y., Gao, X., Stoyanov, D., Dou, Q., Heng, P.-A.: Trans-svnet: hybrid embedding aggregation transformer for surgical workflow analysis. International Journal of Computer Assisted Radiology and Surgery 17(12), 2193–2202 (2022) Zhang et al. [2022] Zhang, B., Abbing, J., Ghanem, A., Fer, D., Barker, J., Abukhalil, R., Goel, V.K., Milletarì, F.: Towards accurate surgical workflow recognition with convolutional networks and transformers. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 10(4), 349–356 (2022) Kirtac et al. [2022] Kirtac, K., Aydin, N., Lavanchy, J.L., Beldi, G., Smit, M., Woods, M.S., Aspart, F.: Surgical phase recognition: From public datasets to real-world data. Applied Sciences 12(17), 8746 (2022) Demir et al. [2022] Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Long, Y., Gao, X., Stoyanov, D., Dou, Q., Heng, P.-A.: Trans-svnet: hybrid embedding aggregation transformer for surgical workflow analysis. International Journal of Computer Assisted Radiology and Surgery 17(12), 2193–2202 (2022) Zhang et al. [2022] Zhang, B., Abbing, J., Ghanem, A., Fer, D., Barker, J., Abukhalil, R., Goel, V.K., Milletarì, F.: Towards accurate surgical workflow recognition with convolutional networks and transformers. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 10(4), 349–356 (2022) Kirtac et al. [2022] Kirtac, K., Aydin, N., Lavanchy, J.L., Beldi, G., Smit, M., Woods, M.S., Aspart, F.: Surgical phase recognition: From public datasets to real-world data. Applied Sciences 12(17), 8746 (2022) Demir et al. [2022] Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Abbing, J., Ghanem, A., Fer, D., Barker, J., Abukhalil, R., Goel, V.K., Milletarì, F.: Towards accurate surgical workflow recognition with convolutional networks and transformers. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 10(4), 349–356 (2022) Kirtac et al. [2022] Kirtac, K., Aydin, N., Lavanchy, J.L., Beldi, G., Smit, M., Woods, M.S., Aspart, F.: Surgical phase recognition: From public datasets to real-world data. Applied Sciences 12(17), 8746 (2022) Demir et al. [2022] Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kirtac, K., Aydin, N., Lavanchy, J.L., Beldi, G., Smit, M., Woods, M.S., Aspart, F.: Surgical phase recognition: From public datasets to real-world data. Applied Sciences 12(17), 8746 (2022) Demir et al. [2022] Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  2. Zia, A., Hung, A., Essa, I., Jarc, A.: Surgical activity recognition in robot-assisted radical prostatectomy using deep learning. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part IV 11, pp. 273–280 (2018). Springer Jin et al. [2021] Jin, Y., Long, Y., Chen, C., Zhao, Z., Dou, Q., Heng, P.-A.: Temporal memory relation network for workflow recognition from surgical video. IEEE Transactions on Medical Imaging 40(7), 1911–1923 (2021) Jin et al. [2022] Jin, Y., Long, Y., Gao, X., Stoyanov, D., Dou, Q., Heng, P.-A.: Trans-svnet: hybrid embedding aggregation transformer for surgical workflow analysis. International Journal of Computer Assisted Radiology and Surgery 17(12), 2193–2202 (2022) Zhang et al. [2022] Zhang, B., Abbing, J., Ghanem, A., Fer, D., Barker, J., Abukhalil, R., Goel, V.K., Milletarì, F.: Towards accurate surgical workflow recognition with convolutional networks and transformers. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 10(4), 349–356 (2022) Kirtac et al. [2022] Kirtac, K., Aydin, N., Lavanchy, J.L., Beldi, G., Smit, M., Woods, M.S., Aspart, F.: Surgical phase recognition: From public datasets to real-world data. Applied Sciences 12(17), 8746 (2022) Demir et al. [2022] Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Long, Y., Chen, C., Zhao, Z., Dou, Q., Heng, P.-A.: Temporal memory relation network for workflow recognition from surgical video. IEEE Transactions on Medical Imaging 40(7), 1911–1923 (2021) Jin et al. [2022] Jin, Y., Long, Y., Gao, X., Stoyanov, D., Dou, Q., Heng, P.-A.: Trans-svnet: hybrid embedding aggregation transformer for surgical workflow analysis. International Journal of Computer Assisted Radiology and Surgery 17(12), 2193–2202 (2022) Zhang et al. [2022] Zhang, B., Abbing, J., Ghanem, A., Fer, D., Barker, J., Abukhalil, R., Goel, V.K., Milletarì, F.: Towards accurate surgical workflow recognition with convolutional networks and transformers. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 10(4), 349–356 (2022) Kirtac et al. [2022] Kirtac, K., Aydin, N., Lavanchy, J.L., Beldi, G., Smit, M., Woods, M.S., Aspart, F.: Surgical phase recognition: From public datasets to real-world data. Applied Sciences 12(17), 8746 (2022) Demir et al. [2022] Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Long, Y., Gao, X., Stoyanov, D., Dou, Q., Heng, P.-A.: Trans-svnet: hybrid embedding aggregation transformer for surgical workflow analysis. International Journal of Computer Assisted Radiology and Surgery 17(12), 2193–2202 (2022) Zhang et al. [2022] Zhang, B., Abbing, J., Ghanem, A., Fer, D., Barker, J., Abukhalil, R., Goel, V.K., Milletarì, F.: Towards accurate surgical workflow recognition with convolutional networks and transformers. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 10(4), 349–356 (2022) Kirtac et al. [2022] Kirtac, K., Aydin, N., Lavanchy, J.L., Beldi, G., Smit, M., Woods, M.S., Aspart, F.: Surgical phase recognition: From public datasets to real-world data. Applied Sciences 12(17), 8746 (2022) Demir et al. [2022] Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Abbing, J., Ghanem, A., Fer, D., Barker, J., Abukhalil, R., Goel, V.K., Milletarì, F.: Towards accurate surgical workflow recognition with convolutional networks and transformers. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 10(4), 349–356 (2022) Kirtac et al. [2022] Kirtac, K., Aydin, N., Lavanchy, J.L., Beldi, G., Smit, M., Woods, M.S., Aspart, F.: Surgical phase recognition: From public datasets to real-world data. Applied Sciences 12(17), 8746 (2022) Demir et al. [2022] Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kirtac, K., Aydin, N., Lavanchy, J.L., Beldi, G., Smit, M., Woods, M.S., Aspart, F.: Surgical phase recognition: From public datasets to real-world data. Applied Sciences 12(17), 8746 (2022) Demir et al. [2022] Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  3. Jin, Y., Long, Y., Chen, C., Zhao, Z., Dou, Q., Heng, P.-A.: Temporal memory relation network for workflow recognition from surgical video. IEEE Transactions on Medical Imaging 40(7), 1911–1923 (2021) Jin et al. [2022] Jin, Y., Long, Y., Gao, X., Stoyanov, D., Dou, Q., Heng, P.-A.: Trans-svnet: hybrid embedding aggregation transformer for surgical workflow analysis. International Journal of Computer Assisted Radiology and Surgery 17(12), 2193–2202 (2022) Zhang et al. [2022] Zhang, B., Abbing, J., Ghanem, A., Fer, D., Barker, J., Abukhalil, R., Goel, V.K., Milletarì, F.: Towards accurate surgical workflow recognition with convolutional networks and transformers. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 10(4), 349–356 (2022) Kirtac et al. [2022] Kirtac, K., Aydin, N., Lavanchy, J.L., Beldi, G., Smit, M., Woods, M.S., Aspart, F.: Surgical phase recognition: From public datasets to real-world data. Applied Sciences 12(17), 8746 (2022) Demir et al. [2022] Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Long, Y., Gao, X., Stoyanov, D., Dou, Q., Heng, P.-A.: Trans-svnet: hybrid embedding aggregation transformer for surgical workflow analysis. International Journal of Computer Assisted Radiology and Surgery 17(12), 2193–2202 (2022) Zhang et al. [2022] Zhang, B., Abbing, J., Ghanem, A., Fer, D., Barker, J., Abukhalil, R., Goel, V.K., Milletarì, F.: Towards accurate surgical workflow recognition with convolutional networks and transformers. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 10(4), 349–356 (2022) Kirtac et al. [2022] Kirtac, K., Aydin, N., Lavanchy, J.L., Beldi, G., Smit, M., Woods, M.S., Aspart, F.: Surgical phase recognition: From public datasets to real-world data. Applied Sciences 12(17), 8746 (2022) Demir et al. [2022] Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Abbing, J., Ghanem, A., Fer, D., Barker, J., Abukhalil, R., Goel, V.K., Milletarì, F.: Towards accurate surgical workflow recognition with convolutional networks and transformers. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 10(4), 349–356 (2022) Kirtac et al. [2022] Kirtac, K., Aydin, N., Lavanchy, J.L., Beldi, G., Smit, M., Woods, M.S., Aspart, F.: Surgical phase recognition: From public datasets to real-world data. Applied Sciences 12(17), 8746 (2022) Demir et al. [2022] Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kirtac, K., Aydin, N., Lavanchy, J.L., Beldi, G., Smit, M., Woods, M.S., Aspart, F.: Surgical phase recognition: From public datasets to real-world data. Applied Sciences 12(17), 8746 (2022) Demir et al. [2022] Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  4. Jin, Y., Long, Y., Gao, X., Stoyanov, D., Dou, Q., Heng, P.-A.: Trans-svnet: hybrid embedding aggregation transformer for surgical workflow analysis. International Journal of Computer Assisted Radiology and Surgery 17(12), 2193–2202 (2022) Zhang et al. [2022] Zhang, B., Abbing, J., Ghanem, A., Fer, D., Barker, J., Abukhalil, R., Goel, V.K., Milletarì, F.: Towards accurate surgical workflow recognition with convolutional networks and transformers. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 10(4), 349–356 (2022) Kirtac et al. [2022] Kirtac, K., Aydin, N., Lavanchy, J.L., Beldi, G., Smit, M., Woods, M.S., Aspart, F.: Surgical phase recognition: From public datasets to real-world data. Applied Sciences 12(17), 8746 (2022) Demir et al. [2022] Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Abbing, J., Ghanem, A., Fer, D., Barker, J., Abukhalil, R., Goel, V.K., Milletarì, F.: Towards accurate surgical workflow recognition with convolutional networks and transformers. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 10(4), 349–356 (2022) Kirtac et al. [2022] Kirtac, K., Aydin, N., Lavanchy, J.L., Beldi, G., Smit, M., Woods, M.S., Aspart, F.: Surgical phase recognition: From public datasets to real-world data. Applied Sciences 12(17), 8746 (2022) Demir et al. [2022] Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kirtac, K., Aydin, N., Lavanchy, J.L., Beldi, G., Smit, M., Woods, M.S., Aspart, F.: Surgical phase recognition: From public datasets to real-world data. Applied Sciences 12(17), 8746 (2022) Demir et al. [2022] Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  5. Zhang, B., Abbing, J., Ghanem, A., Fer, D., Barker, J., Abukhalil, R., Goel, V.K., Milletarì, F.: Towards accurate surgical workflow recognition with convolutional networks and transformers. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 10(4), 349–356 (2022) Kirtac et al. [2022] Kirtac, K., Aydin, N., Lavanchy, J.L., Beldi, G., Smit, M., Woods, M.S., Aspart, F.: Surgical phase recognition: From public datasets to real-world data. Applied Sciences 12(17), 8746 (2022) Demir et al. [2022] Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kirtac, K., Aydin, N., Lavanchy, J.L., Beldi, G., Smit, M., Woods, M.S., Aspart, F.: Surgical phase recognition: From public datasets to real-world data. Applied Sciences 12(17), 8746 (2022) Demir et al. [2022] Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  6. Kirtac, K., Aydin, N., Lavanchy, J.L., Beldi, G., Smit, M., Woods, M.S., Aspart, F.: Surgical phase recognition: From public datasets to real-world data. Applied Sciences 12(17), 8746 (2022) Demir et al. [2022] Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  7. Demir, K.C., Schieber, H., Weise, T., Roth, D., Maier, A., Yang, S.H.: Deep learning in surgical workflow analysis: A review (2022) Valderrama et al. [2022] Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  8. Valderrama, N., Ruiz Puentes, P., Hernández, I., Ayobi, N., Verlyck, M., Santander, J., Caicedo, J., Fernández, N., Arbeláez, P.: Towards holistic surgical scene understanding. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 442–452 (2022). Springer Goldbraikh et al. [2023] Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  9. Goldbraikh, A., Avisdris, N., Pugh, C.M., Laufer, S.: Bounded future ms-tcn++ for surgical gesture recognition. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 406–421 (2023). Springer Konduri and Rao [2023] Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  10. Konduri, P.S., Rao, G.S.N.: Surgical phase recognition in laparoscopic videos using gated capsule autoencoder model. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–23 (2023) Zang et al. [2023] Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  11. Zang, C., Turkcan, M.K., Narasimhan, S., Cao, Y., Yarali, K., Xiang, Z., Szot, S., Ahmad, F., Choksi, S., Bitner, D.P., Filicori, F., Kostic, Z.: Surgical phase recognition in inguinal hernia repair—ai-based confirmatory baseline and exploration of competitive models. Bioengineering 10(6), 654 (2023) Tao et al. [2023] Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  12. Tao, R., Zou, X., Zheng, G.: Last: Latent space-constrained transformers for automatic surgical phase recognition and tool presence detection. IEEE Transactions on Medical Imaging (2023) Liu et al. [2023] Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  13. Liu, Y., Boels, M., Garcia-Peraza-Herrera, L.C., Vercauteren, T., Dasgupta, P., Granados, A., Ourselin, S.: Lovit: Long video transformer for surgical phase recognition. arXiv preprint arXiv:2305.08989 (2023) Jin et al. [2017] Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  14. Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.-W., Heng, P.-A.: Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2017) Jin et al. [2020] Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  15. Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.-W., Heng, P.-A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  16. Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A.: Surgical workflow recognition with 3dcnn for sleeve gastrectomy. International Journal of Computer Assisted Radiology and Surgery 16(11), 2029–2036 (2021) Czempiel et al. [2020] Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  17. Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 343–352 (2020). Springer Fer et al. [2023] Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  18. Fer, D., Zhang, B., Abukhalil, R., Goel, V., Goel, B., Barker, J., Kalesan, B., Barragan, I., Gaddis, M.L., Kilroy, P.G.: An artificial intelligence model that automatically labels roux-en-y gastric bypasses, a comparison to trained surgeon annotators. Surgical Endoscopy, 1–8 (2023) Zhang et al. [2021] Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  19. Zhang, B., Ghanem, A., Simes, A., Choi, H., Yoo, A., Min, A.: Swnet: Surgical workflow recognition with deep convolutional network. In: Medical Imaging with Deep Learning, pp. 855–869 (2021). PMLR Czempiel et al. [2021] Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  20. Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N.: Opera: Attention-regularized transformers for surgical phase recognition. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 604–614 (2021). Springer Zhang et al. [2022a] Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  21. Zhang, B., Goel, B., Sarhan, M.H., Goel, V.K., Abukhalil, R., Kalesan, B., Stottler, N., Petculescu, S.: Surgical workflow recognition with temporal convolution and transformer for action segmentation. International Journal of Computer Assisted Radiology and Surgery, 1–10 (2022) Zhang et al. [2022b] Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  22. Zhang, B., Sturgeon, D., Shankar, A.R., Goel, V.K., Barker, J., Ghanem, A., Lee, P., Milecky, M., Stottler, N., Petculescu, S.: Surgical instrument recognition for instrument usage documentation and surgical video library indexing. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2022) Feichtenhofer et al. [2019] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  23. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019) He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  24. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Farha and Gall [2019] Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  25. Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019) Yi et al. [2021] Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  26. Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation. arXiv preprint arXiv:2110.08568 (2021) Yi et al. [2022] Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  27. Yi, F., Yang, Y., Jiang, T.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 2613–2628 (2022) Wang et al. [2022] Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  28. Wang, Z., Ding, X., Zhao, W., Li, X.: Less is more: Surgical phase recognition from timestamp supervision. arXiv preprint arXiv:2202.08199 (2022) Stein and McKenna [2013] Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  29. Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Fathi et al. [2011] Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  30. Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288 (2011). IEEE Kuehne et al. [2014] Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  31. Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  32. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) Li et al. [2020] Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  33. Li, S.-J., AbuFarha, Y., Liu, Y., Cheng, M.-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) https://doi.org/10.1109/TPAMI.2020.3021756 Carreira and Zisserman [2017] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  34. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Wang et al. [2022] Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  35. Wang, J., Wang, Z., Zhuang, S., Wang, H.: Cross-enhancement transformer for action segmentation. arXiv preprint arXiv:2205.09445 (2022) Park et al. [2022] Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  36. Park, J., Kim, D., Huh, S., Jo, S.: Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recognition 129, 108764 (2022) Liu et al. [2023] Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  37. Liu, D., Li, Q., Dinh, A., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. arXiv preprint arXiv:2303.17959 (2023) Funke et al. [2023] Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  38. Funke, I., Rivoir, D., Speidel, S.: Metrics matter in surgical phase recognition. arXiv preprint arXiv:2305.13961 (2023) Lea et al. [2016] Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  39. Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1642–1649 (2016). IEEE Lea et al. [2017] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  40. Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017) Li et al. [2022] Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  41. Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19880–19889 (2022) van Amsterdam et al. [2023] Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  42. Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: Action segmentation with shared-private representation of multiple data sources. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2384–2393 (2023) Ishihara et al. [2022] Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  43. Ishihara, K., Nakano, G., Inoshita, T.: Mcfm: Mutual cross fusion module for intermediate fusion-based action segmentation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 1701–1705 (2022). IEEE Twinanda [2017] Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  44. Twinanda, A.P.: Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos. PhD thesis, Strasbourg (2017) Behrmann et al. [2022] Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  45. Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 52–68 (2022). Springer Aziere and Todorovic [2022] Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  46. Aziere, N., Todorovic, S.: Multistage temporal convolution transformer for action segmentation. Image and Vision Computing 128, 104567 (2022) Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  47. Chen, M.-H., Li, B., Bao, Y., AlRegib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463 (2020) Wang et al. [2020] Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  48. Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp. 34–51 (2020). Springer Chen et al. [2020] Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  49. Chen, M.-H., Li, B., Bao, Y., AlRegib, G.: Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 605–614 (2020) Gao et al. [2021] Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  50. Gao, S.-H., Han, Q., Li, Z.-Y., Peng, P., Wang, L., Cheng, M.-M.: Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16805–16814 (2021) Ahn and Lee [2021] Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  51. Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16302–16310 (2021) Wang et al. [2021] Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  52. Wang, D., Hu, D., Li, X., Dou, D.: Temporal relational modeling with self-supervision for action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737 (2021) Ishikawa et al. [2021] Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  53. Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021) Chen et al. [2022] Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  54. Chen, L., Li, M., Duan, Y., Zhou, J., Lu, J.: Uncertainty-aware representation learning for action segmentation. In: IJCAI, vol. 2, p. 6 (2022) Du and Wang [2022] Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022) Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
  55. Du, Z., Wang, Q.: Dilated transformer with feature aggregation module for action segmentation. Neural Processing Letters, 1–17 (2022)
Citations (5)

Summary

We haven't generated a summary for this paper yet.